Olympics

In this study I will interrogate the data from the Olympic events to get some interesting insight. The data are from kaggle/piterfm and from 126 Years of Historical Olympic Dataset Even though the data cover a period that goes from 1896, in this study i will focus on the games after the downfall of the USSR, otherwise we will have the result from the USSR team until the downfall and after that, the result of all the new-born states. TO avoid that, and also to reduce the amount of relevant data i will consider only the Olympic games from the 1992 edition. Therefore, unfortunately: - There are no results for qualification rounds. For instance, event 100-m men contains only final results without semi-finals and other hits. - There is no information about athletes for team competitions that consist of more than 2 participants. Only team records.

Env SetUp

Colors

Graphic functions

Dataset

These datasets present a detailed country-wise record of Olympic medals from the first modern Olympics in 1896 to the most recent games in 2024. It provides insights into how different nations have performed over time, including their gold, silver, and bronze medal counts, overall rankings, and total medal tally.

This last dataset contatin the data from Olympedia and provide some additional information about the athlete that compete in the Olympic games.

Data Preprocessing

Athlete

## spc_tbl_ [75,904 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ athlete_url         : chr [1:75904] "https://olympics.com/en/athletes/cooper-woods-topalovic" "https://olympics.com/en/athletes/elofsson" "https://olympics.com/en/athletes/dylan-walczyk" "https://olympics.com/en/athletes/olli-penttala" ...
##  $ athlete_full_name   : chr [1:75904] "Cooper WOODS-TOPALOVIC" "Felix ELOFSSON" "Dylan WALCZYK" "Olli PENTTALA" ...
##  $ games_participations: num [1:75904] 1 2 1 1 1 3 2 2 1 2 ...
##  $ first_game          : chr [1:75904] "Beijing 2022" "PyeongChang 2018" "Beijing 2022" "Beijing 2022" ...
##  $ athlete_year_birth  : num [1:75904] 2000 1995 1993 1995 1989 ...
##  $ athlete_medals      : chr [1:75904] NA NA NA NA ...
##  $ bio                 : chr [1:75904] NA NA NA NA ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   athlete_url = col_character(),
##   ..   athlete_full_name = col_character(),
##   ..   games_participations = col_double(),
##   ..   first_game = col_character(),
##   ..   athlete_year_birth = col_double(),
##   ..   athlete_medals = col_character(),
##   ..   bio = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

I will now operate the following operation on the dataset:

  • Filter only the athlete that has compete in the games starting from, the edition from 1992;
  • Split the column athlete_medals into gold, silver and broze column;
  • Add the debut age of each athlete;
  • Calculate the current age (2025) of each athlete;
  • Adding a binary variable, medalist, that is set 1 if the athlete has won any medal in his career;
  • Make some format adjustment across some variables and select only the ones that are relevant;

Athlete additional

## spc_tbl_ [75,904 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ athlete_full_name: chr [1:75904] "Dylan WALCZYK" "Dmitriy REIKHERD" "Felix ELOFSSON" "Olli PENTTALA" ...
##  $ athlete_url      : chr [1:75904] "https://olympics.com/en/athletes/dylan-walczyk" "https://olympics.com/en/athletes/reikherd" "https://olympics.com/en/athletes/elofsson" "https://olympics.com/en/athletes/olli-penttala" ...
##  $ sex              : chr [1:75904] "Male" NA "Male" "Male" ...
##  $ height           : chr [1:75904] NA NA "184 cm" NA ...
##  $ weight           : chr [1:75904] NA NA "84 kg" NA ...
##  $ NOC              : chr [1:75904] "United States" NA "Sweden" "Finland" ...
##  $ NOC_code         : chr [1:75904] "USA" NA "SWE" "FIN" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   athlete_full_name = col_character(),
##   ..   athlete_url = col_character(),
##   ..   sex = col_character(),
##   ..   height = col_character(),
##   ..   weight = col_character(),
##   ..   NOC = col_character(),
##   ..   NOC_code = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

I’ll perform some necessary format correction to the dataset due to simplify the future operations.

Medals

## spc_tbl_ [21,697 × 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ discipline_title     : chr [1:21697] "Curling" "Curling" "Curling" "Curling" ...
##  $ slug_game            : chr [1:21697] "beijing-2022" "beijing-2022" "beijing-2022" "beijing-2022" ...
##  $ event_title          : chr [1:21697] "Mixed Doubles" "Mixed Doubles" "Mixed Doubles" "Mixed Doubles" ...
##  $ event_gender         : chr [1:21697] "Mixed" "Mixed" "Mixed" "Mixed" ...
##  $ medal_type           : chr [1:21697] "GOLD" "GOLD" "SILVER" "SILVER" ...
##  $ participant_type     : chr [1:21697] "GameTeam" "GameTeam" "GameTeam" "GameTeam" ...
##  $ participant_title    : chr [1:21697] "Italy" "Italy" "Norway" "Norway" ...
##  $ athlete_url          : chr [1:21697] "https://olympics.com/en/athletes/stefania-constantini" "https://olympics.com/en/athletes/amos-mosaner" "https://olympics.com/en/athletes/kristin-skaslien" "https://olympics.com/en/athletes/magnus-nedregotten" ...
##  $ athlete_full_name    : chr [1:21697] "Stefania CONSTANTINI" "Amos MOSANER" "Kristin SKASLIEN" "Magnus NEDREGOTTEN" ...
##  $ country_name         : chr [1:21697] "Italy" "Italy" "Norway" "Norway" ...
##  $ country_code         : chr [1:21697] "IT" "IT" "NO" "NO" ...
##  $ country_3_letter_code: chr [1:21697] "ITA" "ITA" "NOR" "NOR" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   discipline_title = col_character(),
##   ..   slug_game = col_character(),
##   ..   event_title = col_character(),
##   ..   event_gender = col_character(),
##   ..   medal_type = col_character(),
##   ..   participant_type = col_character(),
##   ..   participant_title = col_character(),
##   ..   athlete_url = col_character(),
##   ..   athlete_full_name = col_character(),
##   ..   country_name = col_character(),
##   ..   country_code = col_character(),
##   ..   country_3_letter_code = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

I’ll perform some necessary format correction to the dataset due to simplify the future operations.

Results

## spc_tbl_ [162,804 × 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ discipline_title     : chr [1:162804] "Curling" "Curling" "Curling" "Curling" ...
##  $ event_title          : chr [1:162804] "Mixed Doubles" "Mixed Doubles" "Mixed Doubles" "Mixed Doubles" ...
##  $ slug_game            : chr [1:162804] "beijing-2022" "beijing-2022" "beijing-2022" "beijing-2022" ...
##  $ participant_type     : chr [1:162804] "GameTeam" "GameTeam" "GameTeam" "GameTeam" ...
##  $ medal_type           : chr [1:162804] "GOLD" "SILVER" "BRONZE" NA ...
##  $ athletes             : chr [1:162804] "[('Stefania CONSTANTINI', 'https://olympics.com/en/athletes/stefania-constantini'), ('Amos MOSANER', 'https://o"| __truncated__ "[('Kristin SKASLIEN', 'https://olympics.com/en/athletes/kristin-skaslien'), ('Magnus NEDREGOTTEN', 'https://oly"| __truncated__ "[('Almida DE VAL', 'https://olympics.com/en/athletes/almida-de-val'), ('Oskar ERIKSSON', 'https://olympics.com/"| __truncated__ "[('Jennifer DODDS', 'https://olympics.com/en/athletes/jennifer-dodds'), ('Bruce MOUAT', 'https://olympics.com/e"| __truncated__ ...
##  $ rank_equal           : logi [1:162804] FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ rank_position        : chr [1:162804] "1" "2" "3" "4" ...
##  $ country_name         : chr [1:162804] "Italy" "Norway" "Sweden" "Great Britain" ...
##  $ country_code         : chr [1:162804] "IT" "NO" "SE" "GB" ...
##  $ country_3_letter_code: chr [1:162804] "ITA" "NOR" "SWE" "GBR" ...
##  $ athlete_url          : chr [1:162804] NA NA NA NA ...
##  $ athlete_full_name    : chr [1:162804] NA NA NA NA ...
##  $ value_unit           : chr [1:162804] NA NA NA NA ...
##  $ value_type           : chr [1:162804] NA NA NA NA ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   discipline_title = col_character(),
##   ..   event_title = col_character(),
##   ..   slug_game = col_character(),
##   ..   participant_type = col_character(),
##   ..   medal_type = col_character(),
##   ..   athletes = col_character(),
##   ..   rank_equal = col_logical(),
##   ..   rank_position = col_character(),
##   ..   country_name = col_character(),
##   ..   country_code = col_character(),
##   ..   country_3_letter_code = col_character(),
##   ..   athlete_url = col_character(),
##   ..   athlete_full_name = col_character(),
##   ..   value_unit = col_character(),
##   ..   value_type = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

I’ll perform some necessary format correction to the dataset due to simplify the future operations.

Host

## spc_tbl_ [53 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ game_slug      : chr [1:53] "beijing-2022" "tokyo-2020" "pyeongchang-2018" "rio-2016" ...
##  $ game_end_date  : POSIXct[1:53], format: "2022-02-20 12:00:00" "2021-08-08 14:00:00" ...
##  $ game_start_date: POSIXct[1:53], format: "2022-02-04 15:00:00" "2021-07-23 11:00:00" ...
##  $ game_location  : chr [1:53] "China" "Japan" "Republic of Korea" "Brazil" ...
##  $ game_name      : chr [1:53] "Beijing 2022" "Tokyo 2020" "PyeongChang 2018" "Rio 2016" ...
##  $ game_season    : chr [1:53] "Winter" "Summer" "Winter" "Summer" ...
##  $ game_year      : num [1:53] 2022 2020 2018 2016 2014 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   game_slug = col_character(),
##   ..   game_end_date = col_datetime(format = ""),
##   ..   game_start_date = col_datetime(format = ""),
##   ..   game_location = col_character(),
##   ..   game_name = col_character(),
##   ..   game_season = col_character(),
##   ..   game_year = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

I’ll perform some necessary format correction to the dataset due to simplify the future operations.

Merging dataset

Missing values & outliers

Athlete

##   athlete_id        games_participations  first_game        athlete_year_birth
##  Length:41886       Min.   : 1.000       Length:41886       Min.   :1891      
##  Class :character   1st Qu.: 1.000       Class :character   1st Qu.:1972      
##  Mode  :character   Median : 1.000       Mode  :character   Median :1981      
##                     Mean   : 1.665                          Mean   :1981      
##                     3rd Qu.: 2.000                          3rd Qu.:1990      
##                     Max.   :10.000                          Max.   :2009      
##                                                             NA's   :247       
##       gold             silver           bronze         debut_age     
##  Min.   : 0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :-61.00  
##  1st Qu.: 0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.: 21.00  
##  Median : 0.0000   Median :0.0000   Median :0.0000   Median : 24.00  
##  Mean   : 0.1028   Mean   :0.1001   Mean   :0.1058   Mean   : 24.45  
##  3rd Qu.: 0.0000   3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.: 27.00  
##  Max.   :23.0000   Max.   :6.0000   Max.   :6.0000   Max.   :122.00  
##                                                      NA's   :247     
##       age            medalist     athlete_full_name      sex       
##  Min.   : 16.00   Min.   :0.000   Length:41886       Female:15087  
##  1st Qu.: 35.00   1st Qu.:0.000   Class :character   Male  :24146  
##  Median : 44.00   Median :0.000   Mode  :character   NA's  : 2653  
##  Mean   : 44.14   Mean   :0.186                                    
##  3rd Qu.: 53.00   3rd Qu.:0.000                                    
##  Max.   :134.00   Max.   :1.000                                    
##  NA's   :247                                                       
##      height          weight          NOC_code    
##  Min.   :133.0   Min.   : 28.00   USA    : 2446  
##  1st Qu.:168.0   1st Qu.: 60.00   CAN    : 1606  
##  Median :175.0   Median : 69.00   GER    : 1568  
##  Mean   :175.1   Mean   : 70.33   FRA    : 1499  
##  3rd Qu.:182.0   3rd Qu.: 79.00   JPN    : 1423  
##  Max.   :217.0   Max.   :214.00   (Other):30690  
##  NA's   :9502    NA's   :9502     NA's   : 2654
##           athlete_id games_participations           first_game 
##          0.000000000          0.000000000          0.000000000 
##   athlete_year_birth                 gold               silver 
##          0.005896958          0.000000000          0.000000000 
##               bronze            debut_age                  age 
##          0.000000000          0.005896958          0.005896958 
##             medalist    athlete_full_name                  sex 
##          0.000000000          0.000000000          0.063338586 
##               height               weight             NOC_code 
##          0.226853841          0.226853841          0.063362460

From the analysis of the missing data and the distribution of the variables, some problems stand out among the results: - Some athlete debuts even before they were born. - Some athlete debuts after 75 y/o. - Some athlete attent more than 5 edition, even reaching 10 (at least 40 years of activity). The first two problems probably come from an error in the data collecting process, given that the debut age is calculated by substracting the birth year of an athlete from the debut year. Looking at the boxplot of athlete_year_birth is possible to see that there’s an evident cluster of record that stand outside every possible rare case (born before 1920). Now the problem could lie in the bad collection of the athlete_year_birth or in the collection of first_game. To verify this last possibility, I cross check the athlete born before 1950 in the athlete.full dataset with the result.full dataset.

Check for the oldest athletes

Check for the “super premature” athletes

## [1] 0.9125

The analysis over the birth years and the first game’s year revealed that: - For the oldest, that one that have a correspondence in the results dataset, the edition of 1992 was probably one of the latest, if not the last, and there isn’t any evidence for errors; - For the youngest, for athletes who recorded a debut age of less than 12, 91% do not find a correspondence in the results of the first Olympics that were recorded for them. The decision is to delete the athlete that were born before 1950 that don’t have a correspondence and the athletes with debut_age <= 12

Regarding the problem of the game_partecipations, I have checked and, even if very rare, there has been some very long-lived athlete (career talking), so nothing seem to be out of normal.

Regarding missing values, the ones coming from the athlete dataset are very few (less than 1%), and for those coming from the athlete.additional, they were probably not available. I save the indexes of those missing values to keep track of them and deal with them after according to the needs of the analysis.

Medals

##  discipline_title         slug_game       event_title      event_gender 
##         0.0000000         0.0000000         0.0000000         0.0000000 
##        medal_type  participant_type          NOC_code        athlete_id 
##         0.0000000         0.0000000         0.0000000         0.1951546 
## athlete_full_name participant_title     game_end_date   game_start_date 
##         0.1648454         0.7445361         0.0000000         0.0000000 
##     game_location         game_name       game_season         game_year 
##         0.0000000         0.0000000         0.0000000         0.0000000

From the analysis over the missing values stands out that the vast majority of the missing values are in the participant_title variable, but that does not represent a problem given that most of the values of participant_title are the full name of NOC_code. Also it doesn’t seem to be a relevant variable for any of the aim of the analysis. The medals result that haven’t associated any athlete/athletes to theme (those are the 1599 without both athlete_fill_name and athlete_id), are related to medals in “GameTeam” partitipant_type with more than 2 components, so I’ll set the athlete_full_name and athlete_id as “discipline_title-event_title-Team 2+”. In the end, for those record that has an NA only in athlete_full_name the idea is to check the athlete.full dataset and look if there’s a unique correspondence between the name and the id.

## [1] 212

It seems that among the 212 athlete names without an ID, any of them is present in the athlete.full dataset. For the medals without an athlete associate I save the indexes just to know which one they are

Results

##  discipline_title       event_title         slug_game  participant_type 
##       0.000000000       0.000000000       0.000000000       0.000000000 
##        medal_type        rank_equal     rank_position        value_unit 
##       0.887398841       0.759610807       0.008963507       0.606917691 
##        value_type          NOC_code athlete_full_name        athlete_id 
##       0.547366099       0.000000000       0.073867661       0.105425709 
##     game_end_date   game_start_date     game_location         game_name 
##       0.000000000       0.000000000       0.000000000       0.000000000 
##       game_season         game_year 
##       0.000000000       0.000000000

In this dataset the missing values are to be treated regarding the type of the variable. For example, in the variable medal_type a missing values probably stands for None, also considering that the ~90% of the data are missing. On the other end, the co-occurrence graph looks very messy. I’ll try some others visualizations of the missing values just to try to understand more.

With these additional analysis it seem that the missing values in the variable rank_equal correspond to FALSE. For the variables value_type and value_unit for the moment doesn’t seem to be there any significant pattern, just probably lack of the information from the source, but is too soon to tell. In the end, for the athlete_id I’ll try to match with the athlete.full dataset as done with medals.full.

Now, leaving aside value_unit and value_type, the missing values in the dataset are: - medal_type : NA means that any medals has been won; - athlete_full_name and athlete_id at the same time : those records correspond for GameTeam results in which the group has more than 2 participant. I’ll set the athlete_full_name and athlete_id as “discipline_title-event_title-Team 2+”. - athlete_id alone : NA means that that athlete doesn’t have any id associate in the athlete.full dataset. I, as before, save the indexed of the data that are actually missing.

Hosts

##   slug_game         game_end_date                 game_start_date              
##  Length:17          Min.   :1992-02-23 19:00:00   Min.   :1992-02-08 07:00:00  
##  Class :character   1st Qu.:1998-02-22 11:00:00   1st Qu.:1998-02-06 23:00:00  
##  Mode  :character   Median :2006-02-26 19:00:00   Median :2006-02-10 07:00:00  
##                     Mean   :2006-07-23 06:07:03   Mean   :2006-07-07 03:03:31  
##                     3rd Qu.:2014-02-23 16:00:00   3rd Qu.:2014-02-07 04:00:00  
##                     Max.   :2022-02-20 12:00:00   Max.   :2022-02-04 15:00:00  
##  game_location       game_name         game_season   game_year   
##  Length:17          Length:17          Summer:8    Min.   :1992  
##  Class :character   Class :character   Winter:9    1st Qu.:1998  
##  Mode  :character   Mode  :character               Median :2006  
##                                                    Mean   :2006  
##                                                    3rd Qu.:2014  
##                                                    Max.   :2022
## tibble [17 × 7] (S3: tbl_df/tbl/data.frame)
##  $ slug_game      : chr [1:17] "beijing-2022" "tokyo-2020" "pyeongchang-2018" "rio-2016" ...
##  $ game_end_date  : POSIXct[1:17], format: "2022-02-20 12:00:00" "2021-08-08 14:00:00" ...
##  $ game_start_date: POSIXct[1:17], format: "2022-02-04 15:00:00" "2021-07-23 11:00:00" ...
##  $ game_location  : chr [1:17] "China" "Japan" "Republic of Korea" "Brazil" ...
##  $ game_name      : chr [1:17] "Beijing 2022" "Tokyo 2020" "PyeongChang 2018" "Rio 2016" ...
##  $ game_season    : Factor w/ 2 levels "Summer","Winter": 2 1 2 1 2 1 2 1 2 1 ...
##  $ game_year      : num [1:17] 2022 2020 2018 2016 2014 ...

Descriptive

Now I’ll visualize the data and try to extract some relevant insight.

## [1] "China TRUE"
## [1] "Japan TRUE"
## [1] "Republic of Korea FALSE"
## [1] "Brazil TRUE"
## [1] "Russian Federation FALSE"
## [1] "Great Britain FALSE"
## [1] "Canada TRUE"
## [1] "Italy TRUE"
## [1] "Greece TRUE"
## [1] "United States FALSE"
## [1] "Australia TRUE"
## [1] "Norway TRUE"
## [1] "Spain TRUE"
## [1] "France TRUE"

After 1992 the Olympic Games has been hosted spread all over the continents, except for Africa, with USA, Japan and China elected two times to be the hosting region.

As shown by the image, in the period after 1992 there has been an undisputed dominance by, in order, USA, China, Russia and Germany, follow by UK, Australia, France, Japan and Italy, regarding the summer editions. Indeed, looking at the winter editions, the podium change a bit, with the Germany as absolute ruler, followed by Norway and USA. Brilliant performance has been achieved also by Canada, Austria, Russia and Italy.

Looking at the evolution of the performance of the nations over the time, regarding the summer editions it is possible to observe a remarkable uptrend for UK, Japan and, a little bit less, China, while Germany shows a very significant downtrend. These patterns are not casual: the rise of Great Britain is strongly linked to the consistent investment plan launched after Sydney 2000 and reinforced towards London 2012, while Japan’s progression is connected to the preparation for Tokyo 2020 and a long-term investment in sports science. China’s boost reflects the national strategy to use sport as a symbol of prestige, with a clear turning point around Beijing 2008. On the contrary, Germany’s decline after the early ’90s is strongly related to the end of the centralized and highly efficient sports system of East Germany, which before the reunification guaranteed very high results. Therefore, is evident the undisputed dominance of USA across the years, based on an enormous pool of athletes, strong university sports programs and the ability to remain competitive in a wide variety of disciplines. In the last 20 years, also China emerges as a stable counterpart in the top positions.

Instead, looking at the winter games, a remarkable uptrend is highlighted by USA and Canada and, in the last 20 years, by Norway until nowadays domination. These dynamics can be explained by the rise of new winter disciplines (snowboard, freestyle) where USA have a cultural and commercial advantage, and by the Canadian program Own the Podium launched for Vancouver 2010, which gave a permanent boost. Norway’s extraordinary growth is the result of a long-lasting tradition in Nordic sports, combined with an ethical and sustainable approach that avoided the collapse seen in other countries after doping scandals. Otherwise the summer editions, here the dominance of Germany isn’t as evident as for the USA. In fact, in the short term the pole position clearly shifted to Norway, but looking at the cumulative number of medals Germany still keeps the overall lead — a lead that seems however increasingly under threat and likely to be surpassed in the next editions.

Let’s inspect where the current dominating nations gain their power (USA, China and UK for the summer editions and Norway and Germany for the winter)

Regarding the summer editions USA have, on average, won around ~11.5 medals per discipline thanks to the absolute domination in athletic and swimming races and, less, thanks to gymnastics artistic and wrestling. This concentration is not casual: athletics and swimming provide many medal opportunities and the USA, with their huge pool of athletes, the strong college sport system and world-class infrastructures, have been able to consistently transform quantity into quality. China, with a lower average of ~9.55 medals per discipline, has been competitive, but not disruptive as USA in athletics and swimming, and has instead built its strength on a very wide variety of sports such as badminton, diving, shooting, table tennis, gymnastics artistic and weightlifting. This reflects a precise national strategy: investing on sports with many medal events or less global competition, supported by centralized talent identification and early specialization. UK, with an even lower average of ~6 medals per discipline, has been competitive in athletics, cycling track, rowing, sailing and swimming. Its success is the result of deliberate investments after Sydney 2000 and especially in preparation for London 2012, focusing on disciplines with strong national tradition and where technology and innovation could make the difference.

Looking at the winter editions, stands out a very similar situation: Norway, with an average of ~13.28 medals per discipline, owes its growing dominance mainly to cross country skiing, biathlon and alpine skiing, with some decent result also in nordic combined, ski jumping and speed skating. The reason for such supremacy lies in the cultural roots of skiing in Norway, practiced massively at every level, and in a federative system oriented to long-term and sustainable athlete development. Germany, with an average of ~10 medals per discipline, has been less dominant as Norway, but has been competitive across multiple disciplines such as alpine skiing, biathlon, bobsleigh, cross country skiing, luge, nordic combined, ski jumping and speed skating. The German model is more diversified than concentrated, reflecting a strong tradition in sliding sports and a well-developed infrastructure, which has guaranteed a broad but less absolute competitiveness.

The single performance of the athletes reflects what has been observed before. For Norway, Marit Bjørgen (cross country skiing), Ole Einar Bjørndalen (biathlon), Bjørn Dæhlie (cross country skiing) and Kjetil André Aamodt (alpine skiing) have performed out of the ordinary, considered statistically as outliers, and their records symbolize the cultural centrality of winter sports in the country. Bjørgen is the most decorated Winter Olympian of all time, while Bjørndalen, known as the “King of Biathlon”, occupied the second place. Aamodt, is still the most successful alpine skier in Olympic history. The USA show a similar trend of outstanding outliers, mainly in swimming and athletics: Michael Phelps is the most decorated Olympian ever, while Katie Ledecky has become the most dominant female swimmer in history. Allyson Felix is the most decorated American track and field athlete, and together with names like Ryan Lochte, Shannon Miller, Aaron Peirsol, Amanda Beard, Natalie Coughlin and Gary Hall Jr., they embody the American dominance across multiple cycles of the Games. On the other hand, for Germany and China the distribution doesn’t denote such extreme outliers, even if remarkable athletes have left a significant mark: for Germany, Uschi Disl (biathlon) and Katja Seizinger (alpine skiing) are the most notable names, while many medals also came from collective events in sliding sports. For China, instead, the dominance has been more distributed, reflecting the national strategy: nevertheless, some athletes stand out for their consistency across multiple Games. In diving, Wu Minxia, Guo Jingjing, Qin Kai, Fu Mingxia, Chen Ruolin and Cao Yuan embody the tradition of excellence that made China the global powerhouse in this discipline. In addition, Wang Yifu in shooting and Sun Yang in swimming, have been central figures of the Chinese success, showing how the country’s performance, though less concentrated on single “outliers” like the USA or Norway, has been built on a solid core of multi-medalists across different disciplines.

What does it takes to win a medal?

In this paper I’m going to explore, given the past performance of the athlete that took part at the Olympic games, what does it takes to be an Olympic medalist in the winter editions (like the 2026 Milano-Cortina Games). Then I’ll try to estimate the chances that the Italian athletes have to win a medal.

Data preparation

I create now the full dataset that I’m going to use after starting from athlete.full and results.full datasets. I immidiately remove the irrelevant variabiles and create some new “indicators” that might be usefull in the analysis: - Medalist: wheather the results correspond to a medal or not; - HomeGame: wheather the athlete was competing in is own nation or not; - participation_number: the cumulative participations for each athlete;

Furthermore I’ll work only on the winter edition, accordingly to the aim of the study.

Winter

Data splitting

## 
##    0    1 
## 0.93 0.07
## 
##    0    1 
## 0.93 0.07
## 
##    0    1 
## 0.92 0.08

After have divided the dataset into the necessary subset and have the same proportion, I proceed with the analysis of the missing values.

Missing values

## tibble [13,982 × 15] (S3: tbl_df/tbl/data.frame)
##  $ discipline_title    : Factor w/ 86 levels "3x3 Basketball",..: 68 13 49 36 68 2 21 70 70 21 ...
##  $ event_title         : chr [1:13982] "Giant parallel slalom men" "Women's 15km Individual" "Singles men" "Pairs mixed" ...
##  $ participant_type    : Factor w/ 2 levels "Athlete","GameTeam": 1 1 1 2 1 1 1 1 1 1 ...
##  $ game_location       : chr [1:13982] "ITA" "CHN" "USA" "CAN" ...
##  $ game_season         : Factor w/ 2 levels "Summer","Winter": 2 2 2 2 2 2 2 2 2 2 ...
##  $ HomeGame            : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Medalist            : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ athlete_year_birth  : num [1:13982] 1984 1997 1969 1986 1993 ...
##  $ debut_age           : num [1:13982] 22 25 23 20 21 23 21 19 23 20 ...
##  $ sex                 : Factor w/ 2 levels "Female","Male": 2 1 2 1 1 1 2 1 2 1 ...
##  $ height              : int [1:13982] 185 NA 178 159 170 170 NA 164 178 168 ...
##  $ weight              : num [1:13982] 84 NA 88 44 62 73 NA 70 73 53 ...
##  $ NOC_code            : Factor w/ 230 levels "AFG","AHO","ALB",..: 82 184 133 174 192 38 113 109 217 109 ...
##  $ result_age          : num [1:13982] 22 25 33 24 21 23 25 23 27 20 ...
##  $ participation_number: int [1:13982] 1 1 4 2 1 1 2 2 2 1 ...
##              discipline_title event_title        participant_type
##  Cross Country Skiing:2901    Length:13982       Athlete :12749  
##  Alpine Skiing       :2797    Class :character   GameTeam: 1233  
##  Biathlon            :1969    Mode  :character                   
##  Speed skating       :1458                                       
##  Snowboard           : 788                                       
##  Freestyle Skiing    : 775                                       
##  (Other)             :3294                                       
##  game_location      game_season    HomeGame  Medalist  athlete_year_birth
##  Length:13982       Summer:    0   0:13220   0:12972   Min.   :1945      
##  Class :character   Winter:13982   1:  762   1: 1010   1st Qu.:1973      
##  Mode  :character                                      Median :1982      
##                                                        Mean   :1982      
##                                                        3rd Qu.:1990      
##                                                        Max.   :2006      
##                                                                          
##    debut_age         sex           height          weight          NOC_code   
##  Min.   :13.00   Female:5663   Min.   :136.0   Min.   : 30.00   USA    : 948  
##  1st Qu.:21.00   Male  :7810   1st Qu.:168.0   1st Qu.: 60.00   CAN    : 842  
##  Median :23.00   NA's  : 509   Median :174.0   Median : 68.00   ITA    : 741  
##  Mean   :23.28                 Mean   :174.3   Mean   : 69.39   GER    : 718  
##  3rd Qu.:25.00                 3rd Qu.:181.0   3rd Qu.: 78.00   FRA    : 663  
##  Max.   :50.00                 Max.   :216.0   Max.   :127.00   (Other):9560  
##                                NA's   :2086    NA's   :2086     NA's   : 510  
##    result_age    participation_number
##  Min.   : 0.00   Min.   :1.000       
##  1st Qu.:23.00   1st Qu.:1.000       
##  Median :26.00   Median :1.000       
##  Mean   :26.24   Mean   :1.675       
##  3rd Qu.:29.00   3rd Qu.:2.000       
##  Max.   :51.00   Max.   :8.000       
## 
##     discipline_title          event_title     participant_type 
##           0.00000000           0.00000000           0.00000000 
##        game_location          game_season             HomeGame 
##           0.00000000           0.00000000           0.00000000 
##             Medalist   athlete_year_birth            debut_age 
##           0.00000000           0.00000000           0.00000000 
##                  sex               height               weight 
##           0.03640395           0.14919182           0.14919182 
##             NOC_code           result_age participation_number 
##           0.03647547           0.00000000           0.00000000

Regarding the missing values, the NA’s in the age variables (debut_age, result_age and athlete_year_birth) are all related to lack of information from the source, but given that they’re only less than 1% of the total, I’ll simply won’t consider them. The presence of NA’s in NOC_code doesn’t represent a problem given that I wont’t probably use that variable in the algorithm. Regarding sex, height and weight, they’re missing due to lack of the information from the source. Given that they’re only the ~5% and ~17% of the total, I’ll impute them with the stratified by sex median for height and weight and sex with the mode of the class.

Preliminar analysis

## P value debut_age 
## [1] 2.594079e-06
## P value height 
## [1] 0.9034867
## P value weight 
## [1] 0.1389711
## P value participation_number 
## [1] 8.943577e-28
## P value result_age 
## [1] 1.291988e-13
## P value participant_type 
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  tbl
## X-squared = 75.521, df = 1, p-value < 2.2e-16
## 
## P value NOC_code
## 
##  Pearson's Chi-squared test
## 
## data:  tbl
## X-squared = NaN, df = 229, p-value = NA
## 
## P value HomeGame 
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  tbl
## X-squared = 7.84, df = 1, p-value = 0.00511
## 
## P value sex 
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  tbl
## X-squared = 7.601, df = 1, p-value = 0.005834

Given the results, the variables that seem correct to consider are debut_age, height, weight, participation_number, participant_type, sex and HomeGame. From now I’ll consider only theese variables. Also in general there aren’t any particolar multicollinearity, if not between Height and Weight and debut_age and result_age.

Let’s inspect the possbile presence of outlier

## $debut_age

## 
## $height

## 
## $weight

## 
## $participation_number

## $debut_age

## 
## $height

## 
## $weight

## 
## $participation_number

## $debut_age

## 
## $participation_number

## $debut_age

## 
## $participation_number

From the distribution plots doesn’t seem to be any particular outlier. Furthermore, after the log trasformations, the normality seem to be respected by all of the variables and the variance and covariance seem to be shared by the classes.

It’s immediately apparent that the variables are distributed across significantly different scales. To compare them, it’s useful to standardize them all onto the same scale.

Classification algorithm

Given that the aim of the project is more oriented to be descriptive, I’ll only evaluate the classification with logistc regression and with a decision tree.

Logistic regression

## 
## Call:
## glm(formula = Medalist ~ ., family = binomial, data = sub_train.w[, 
##     c(relevant_var.numeric.w, relevant_var.vategorical.w, "Medalist")])
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              -2.5126429  0.0662234 -37.942  < 2e-16 ***
## debut_age                -0.1095616  0.0353328  -3.101  0.00193 ** 
## height                    0.0004107  0.0611965   0.007  0.99465    
## weight                    0.1528571  0.0570996   2.677  0.00743 ** 
## participation_number      0.3763478  0.0315043  11.946  < 2e-16 ***
## participant_typeGameTeam  0.8182707  0.0945714   8.652  < 2e-16 ***
## sexMale                  -0.4105080  0.0967102  -4.245 2.19e-05 ***
## HomeGame1                 0.3613143  0.1283715   2.815  0.00488 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 7253.4  on 13981  degrees of freedom
## Residual deviance: 6987.6  on 13974  degrees of freedom
## AIC: 7003.6
## 
## Number of Fisher Scoring iterations: 5
## 
## Call:
## glm(formula = Medalist ~ debut_age + weight + participation_number + 
##     participant_type + sex + HomeGame, family = binomial, data = sub_train.w[, 
##     c(relevant_var.numeric.w, relevant_var.vategorical.w, "Medalist")])
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              -2.51278    0.06310 -39.825  < 2e-16 ***
## debut_age                -0.10956    0.03533  -3.101 0.001928 ** 
## weight                    0.15311    0.04259   3.595 0.000324 ***
## participation_number      0.37635    0.03150  11.948  < 2e-16 ***
## participant_typeGameTeam  0.81821    0.09415   8.690  < 2e-16 ***
## sexMale                  -0.41027    0.08997  -4.560 5.11e-06 ***
## HomeGame1                 0.36129    0.12831   2.816 0.004867 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 7253.4  on 13981  degrees of freedom
## Residual deviance: 6987.6  on 13975  degrees of freedom
## AIC: 7001.6
## 
## Number of Fisher Scoring iterations: 5

Let’s check if there’s any influence point that could introduce bias in the model

Let’s remove those point and calculate again the model

## 
## Call:
## glm(formula = Medalist ~ ., family = binomial, data = sub_train.w[-influence_idx, 
##     c(relevant_var.numeric.w, relevant_var.vategorical.w, "Medalist")])
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              -2.50363    0.06618 -37.832  < 2e-16 ***
## debut_age                -0.11900    0.03547  -3.355 0.000793 ***
## height                    0.01161    0.06131   0.189 0.849819    
## weight                    0.15559    0.05719   2.721 0.006515 ** 
## participation_number      0.37820    0.03159  11.972  < 2e-16 ***
## participant_typeGameTeam  0.82516    0.09486   8.699  < 2e-16 ***
## sexMale                  -0.43633    0.09694  -4.501 6.77e-06 ***
## HomeGame1                 0.33126    0.13047   2.539 0.011116 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 7227.0  on 13975  degrees of freedom
## Residual deviance: 6957.5  on 13968  degrees of freedom
## AIC: 6973.5
## 
## Number of Fisher Scoring iterations: 5
## 
## Call:
## glm(formula = Medalist ~ debut_age + weight + participation_number + 
##     participant_type + sex + HomeGame, family = binomial, data = sub_train.w[-influence_idx, 
##     c(relevant_var.numeric.w, relevant_var.vategorical.w, "Medalist")])
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              -2.50743    0.06310 -39.739  < 2e-16 ***
## debut_age                -0.11890    0.03546  -3.353 0.000799 ***
## weight                    0.16280    0.04263   3.819 0.000134 ***
## participation_number      0.37829    0.03159  11.976  < 2e-16 ***
## participant_typeGameTeam  0.82345    0.09443   8.720  < 2e-16 ***
## sexMale                  -0.42959    0.09019  -4.763  1.9e-06 ***
## HomeGame1                 0.33051    0.13040   2.535 0.011259 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 7227.0  on 13975  degrees of freedom
## Residual deviance: 6957.5  on 13969  degrees of freedom
## AIC: 6971.5
## 
## Number of Fisher Scoring iterations: 5

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 7026    0
##          1  504    0
##                                           
##                Accuracy : 0.9331          
##                  95% CI : (0.9272, 0.9386)
##     No Information Rate : 1               
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.9331          
##             Specificity :     NA          
##          Pos Pred Value :     NA          
##          Neg Pred Value :     NA          
##              Prevalence : 1.0000          
##          Detection Rate : 0.9331          
##    Detection Prevalence : 0.9331          
##       Balanced Accuracy :     NA          
##                                           
##        'Positive' Class : 0               
## 

As shown by the results, the threshold of .5 doesn’t not perfom at all, let’s evaluate different values of threshold and see which one is the best.

## [1] 0.001
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0    0 7026
##          1    0  504
##                                           
##                Accuracy : 0.0669          
##                  95% CI : (0.0614, 0.0728)
##     No Information Rate : 1               
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity :      NA         
##             Specificity : 0.06693         
##          Pos Pred Value :      NA         
##          Neg Pred Value :      NA         
##              Prevalence : 0.00000         
##          Detection Rate : 0.00000         
##    Detection Prevalence : 0.93307         
##       Balanced Accuracy :      NA         
##                                           
##        'Positive' Class : 0               
##                                           
## [1] 0.05
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 2689 4337
##          1  103  401
##                                           
##                Accuracy : 0.4104          
##                  95% CI : (0.3992, 0.4216)
##     No Information Rate : 0.6292          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.0364          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.96311         
##             Specificity : 0.08463         
##          Pos Pred Value : 0.38272         
##          Neg Pred Value : 0.79563         
##              Prevalence : 0.37078         
##          Detection Rate : 0.35710         
##    Detection Prevalence : 0.93307         
##       Balanced Accuracy : 0.52387         
##                                           
##        'Positive' Class : 0               
##                                           
## [1] 0.1
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 5820 1206
##          1  327  177
##                                           
##                Accuracy : 0.7964          
##                  95% CI : (0.7871, 0.8055)
##     No Information Rate : 0.8163          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.0992          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.9468          
##             Specificity : 0.1280          
##          Pos Pred Value : 0.8284          
##          Neg Pred Value : 0.3512          
##              Prevalence : 0.8163          
##          Detection Rate : 0.7729          
##    Detection Prevalence : 0.9331          
##       Balanced Accuracy : 0.5374          
##                                           
##        'Positive' Class : 0               
##                                           
## [1] 0.15
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 6746  280
##          1  445   59
##                                           
##                Accuracy : 0.9037          
##                  95% CI : (0.8968, 0.9103)
##     No Information Rate : 0.955           
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.091           
##                                           
##  Mcnemar's Test P-Value : 1.123e-09       
##                                           
##             Sensitivity : 0.9381          
##             Specificity : 0.1740          
##          Pos Pred Value : 0.9601          
##          Neg Pred Value : 0.1171          
##              Prevalence : 0.9550          
##          Detection Rate : 0.8959          
##    Detection Prevalence : 0.9331          
##       Balanced Accuracy : 0.5561          
##                                           
##        'Positive' Class : 0               
##                                           
## [1] 0.2
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 6949   77
##          1  492   12
##                                           
##                Accuracy : 0.9244          
##                  95% CI : (0.9182, 0.9303)
##     No Information Rate : 0.9882          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.0208          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.93388         
##             Specificity : 0.13483         
##          Pos Pred Value : 0.98904         
##          Neg Pred Value : 0.02381         
##              Prevalence : 0.98818         
##          Detection Rate : 0.92284         
##    Detection Prevalence : 0.93307         
##       Balanced Accuracy : 0.53436         
##                                           
##        'Positive' Class : 0               
##                                           
## [1] 0.3
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 7024    2
##          1  502    2
##                                           
##                Accuracy : 0.9331          
##                  95% CI : (0.9272, 0.9386)
##     No Information Rate : 0.9995          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.0068          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.933298        
##             Specificity : 0.500000        
##          Pos Pred Value : 0.999715        
##          Neg Pred Value : 0.003968        
##              Prevalence : 0.999469        
##          Detection Rate : 0.932802        
##    Detection Prevalence : 0.933068        
##       Balanced Accuracy : 0.716649        
##                                           
##        'Positive' Class : 0               
##                                           
## [1] 0.4
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 7026    0
##          1  504    0
##                                           
##                Accuracy : 0.9331          
##                  95% CI : (0.9272, 0.9386)
##     No Information Rate : 1               
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.9331          
##             Specificity :     NA          
##          Pos Pred Value :     NA          
##          Neg Pred Value :     NA          
##              Prevalence : 1.0000          
##          Detection Rate : 0.9331          
##    Detection Prevalence : 0.9331          
##       Balanced Accuracy :     NA          
##                                           
##        'Positive' Class : 0               
##                                           
## [1] 0.5
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 7026    0
##          1  504    0
##                                           
##                Accuracy : 0.9331          
##                  95% CI : (0.9272, 0.9386)
##     No Information Rate : 1               
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.9331          
##             Specificity :     NA          
##          Pos Pred Value :     NA          
##          Neg Pred Value :     NA          
##              Prevalence : 1.0000          
##          Detection Rate : 0.9331          
##    Detection Prevalence : 0.9331          
##       Balanced Accuracy :     NA          
##                                           
##        'Positive' Class : 0               
## 

The best threshold seem to be .3 ### Tree

## node), split, n, deviance, yval, (yprob)
##       * denotes terminal node
## 
##   1) root 13982 7253.00 0 ( 0.927764 0.072236 )  
##     2) participation_number < -0.100224 7737 2966.00 0 ( 0.952307 0.047693 )  
##       4) participant_type: Athlete 7016 2503.00 0 ( 0.956670 0.043330 ) *
##       5) participant_type: GameTeam 721  436.80 0 ( 0.909847 0.090153 ) *
##     3) participation_number > -0.100224 6245 4132.00 0 ( 0.897358 0.102642 )  
##       6) participant_type: Athlete 5733 3583.00 0 ( 0.905634 0.094366 )  
##        12) debut_age < -0.151209 3092 2184.00 0 ( 0.886805 0.113195 )  
##          24) height < 0.109535 1748 1373.00 0 ( 0.866705 0.133295 ) *
##          25) height > 0.109535 1344  794.80 0 ( 0.912946 0.087054 ) *
##        13) debut_age > -0.151209 2641 1371.00 0 ( 0.927679 0.072321 )  
##          26) weight < 1.3013 2328 1097.00 0 ( 0.936856 0.063144 )  
##            52) debut_age < 0.937697 1829  927.70 0 ( 0.930016 0.069984 ) *
##            53) debut_age > 0.937697 499  161.50 0 ( 0.961924 0.038076 )  
##             106) weight < 0.0839942 293  129.80 0 ( 0.941980 0.058020 ) *
##             107) weight > 0.0839942 206   22.52 0 ( 0.990291 0.009709 ) *
##          27) weight > 1.3013 313  254.20 0 ( 0.859425 0.140575 ) *
##       7) participant_type: GameTeam 512  505.70 0 ( 0.804688 0.195312 ) *

Given the strong imbalance of the classes (it is very rare to win a medal) the decision tree does not perform well at all, always classifying 0. So I proceed to evaluate and confirm the perfonces of the logistic regression.

Evaluation

Train

## 
## Call:
## glm(formula = Medalist ~ debut_age + weight + participation_number + 
##     participant_type + sex + HomeGame, family = binomial, data = train.w[, 
##     c(relevant_var.numeric.w, relevant_var.vategorical.w, "Medalist")])
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              -17.48071    3.87564  -4.510 6.47e-06 ***
## debut_age                 -2.84465    0.68815  -4.134 3.57e-05 ***
## weight                     0.12752    0.03505   3.638 0.000275 ***
## participation_number       0.73669    0.04988  14.770  < 2e-16 ***
## participant_typeGameTeam   0.83708    0.07644  10.950  < 2e-16 ***
## sexMale                   -0.39378    0.07346  -5.361 8.30e-08 ***
## HomeGame1                  0.40354    0.10279   3.926 8.64e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 10955  on 21511  degrees of freedom
## Residual deviance: 10551  on 21505  degrees of freedom
## AIC: 10565
## 
## Number of Fisher Scoring iterations: 5

## 
## Call:
## glm(formula = Medalist ~ debut_age + weight + participation_number + 
##     participant_type + sex + HomeGame, family = binomial, data = train.w[-influence_idx, 
##     c(relevant_var.numeric.w, relevant_var.vategorical.w, "Medalist")])
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              -18.36794    3.88542  -4.727 2.27e-06 ***
## debut_age                 -3.00408    0.68987  -4.355 1.33e-05 ***
## weight                     0.13338    0.03508   3.802 0.000143 ***
## participation_number       0.73922    0.04997  14.793  < 2e-16 ***
## participant_typeGameTeam   0.84112    0.07660  10.981  < 2e-16 ***
## sexMale                   -0.40586    0.07358  -5.516 3.47e-08 ***
## HomeGame1                  0.38447    0.10385   3.702 0.000214 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 10928  on 21505  degrees of freedom
## Residual deviance: 10521  on 21499  degrees of freedom
## AIC: 10535
## 
## Number of Fisher Scoring iterations: 5

Test

##              discipline_title event_title        participant_type
##  Alpine Skiing       :1121    Length:5378        Athlete :4892   
##  Cross Country Skiing:1110    Class :character   GameTeam: 486   
##  Biathlon            : 741    Mode  :character                   
##  Speed skating       : 521                                       
##  Snowboard           : 342                                       
##  Freestyle Skiing    : 315                                       
##  (Other)             :1228                                       
##  game_location      game_season   HomeGame Medalist athlete_year_birth
##  Length:5378        Summer:   0   0:5063   0:4972   Min.   :1950      
##  Class :character   Winter:5378   1: 315   1: 406   1st Qu.:1973      
##  Mode  :character                                   Median :1982      
##                                                     Mean   :1982      
##                                                     3rd Qu.:1990      
##                                                     Max.   :2008      
##                                                                       
##    debut_age         sex           height          weight          NOC_code   
##  Min.   :14.00   Female:2233   Min.   :136.0   Min.   : 30.00   USA    : 394  
##  1st Qu.:21.00   Male  :2974   1st Qu.:168.0   1st Qu.: 60.00   CAN    : 328  
##  Median :23.00   NA's  : 171   Median :174.0   Median : 68.00   ITA    : 286  
##  Mean   :23.34                 Mean   :174.2   Mean   : 69.44   GER    : 256  
##  3rd Qu.:25.00                 3rd Qu.:180.0   3rd Qu.: 78.00   FRA    : 241  
##  Max.   :47.00                 Max.   :204.0   Max.   :125.00   (Other):3702  
##                                NA's   :790     NA's   :790      NA's   : 171  
##    result_age    participation_number
##  Min.   :14.00   Min.   :1.00        
##  1st Qu.:23.00   1st Qu.:1.00        
##  Median :26.00   Median :1.00        
##  Mean   :26.21   Mean   :1.66        
##  3rd Qu.:29.00   3rd Qu.:2.00        
##  Max.   :51.00   Max.   :8.00        
## 
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 4965    7
##          1  405    1
##                                          
##                Accuracy : 0.9234         
##                  95% CI : (0.916, 0.9304)
##     No Information Rate : 0.9985         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.0019         
##                                          
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.924581       
##             Specificity : 0.125000       
##          Pos Pred Value : 0.998592       
##          Neg Pred Value : 0.002463       
##              Prevalence : 0.998512       
##          Detection Rate : 0.923206       
##    Detection Prevalence : 0.924507       
##       Balanced Accuracy : 0.524791       
##                                          
##        'Positive' Class : 0              
## 
## [[1]]
## [1] 0.6524198

The model does’t seem to perform very well but can give us an idea about what does it it takes to win a medal. In particular it highlights that winning an Olympic medal is not a matter of chance but the result of several measurable factors. Between those the most remarkable are: - Athletes debuting at a younger age have a higher probability of reaching the podium, reflecting the importance of early specialization and long-term career development; - The number of participations is one of the strongest predictors: experience accumulated across multiple Games significantly increases the chances of success; - Being part of a GameTeam rather than competing individually also improves the odds, as collective events tend to guarantee more stable performances. - Competing at home (HomeGame) offers a tangible advantage, confirming the existence of a “home effect” in the Olympic Games.